QC, integration, regression, clustering, and data exploration of scRNAseq samples collected over a time course of zebrafish caudal fin regeneration. Raw data from Hou et al., Science Advances, 2020.

set directory to folder where raw count matrices are

load libraries

shortcut to load the object generated by this notebook

load count matrices. in this case, features are named with ensembl ids, not gene names, so i created a custom function to wrangle and rename features (source code can be found in my mebfunctions repo).

generate QC metrics and figs - i built a custom function to do this in a consistent way and produce a nice figure, but you could also do it manually. source code for this function available in the mebfunctions repo. - we can see that there are slightly more genes and molecules detected in the d2 and d4 datasets, but %mito and log10genes/umi metrics are similar. - based on %mito and RNA and feature counts, it looks like this data has already been filtered, but the code below is what i would normally suggest for filtering.

integrate datasets from four timepoints: - normalize data (takes feature counts, divides by total counts, multiplies by scaling factor, natural log) - find variable features (identifies genes with high variability between cells) - select integration features (ranks all features by the number of datasets in which they are highly variable) - find anchors (essentially, this performs the following for all combinations of datasets: dimensional reduction, identification of pairs of cells that are within each other’s neighborhoods, filtering of low-confidence anchors by checking to see if query cell is first nieghbors with reference cell at max dimensionality, scores remaining anchors by computing overlap between reference anchor neighbors and query anchor neighbors, and then rescales to reduce outlier effects) - integrate data (iterative pairwise integration - orders integration based on pairs of datasets with low cell#/anchor ratios, creates a matrix of weights between query cells and anchors based on distance between cell and anchor, distance between cell and k.weightth anchor, and the anchor score computed previously. then, computes difference between expression matrices of ever pair of anchor cells, computes product of this matrix and the weights matrix, subtracts that product from the original expression matrix)

perform linear dimensional reduction, compute neighbors and clusters, visualize and choose resolution

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 16334
## Number of edges: 581950
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9669
## Number of communities: 7
## Elapsed time: 1 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 16334
## Number of edges: 581950
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9315
## Number of communities: 14
## Elapsed time: 2 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 16334
## Number of edges: 581950
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9107
## Number of communities: 14
## Elapsed time: 2 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 16334
## Number of edges: 581950
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8881
## Number of communities: 18
## Elapsed time: 1 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 16334
## Number of edges: 581950
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8748
## Number of communities: 22
## Elapsed time: 1 seconds

score cells based on cell cycle (optional, but i find exploring this very useful, especially in data sets where we know there are proliferation differences between conditions, like this one)

annotate the clusters based on domain gene expression knowledge - here i’m using gene expression patterns that are well characterized in the regenerating tailfin to define cluster identity. you could also find marker genes enriched in each cluster in order to identify them.

plot the data seperated by timepoint (in this case, to visualize clearly how populations change over the course of regeneration, and the desparity in cell number between early regeneration data set and d2 and d4)